library(ggplot2)
library(gridExtra)
library(GGally)
library(scales)
wqw <- read.csv("/Users/olivier/Desktop/Udacity/rstudio/assignment/wineQualityWhites.csv")
Number of rows
nrow(wqw)
## [1] 4898
Variables
names(wqw)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
summary(wqw)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
As very new to wine characteristic, I did some research on the variables name to understand their impact on wine taste.:
The X is the anonymized unique ID of the wine, so let's make it as factor.
wqw$X <- as.factor(wqw$X)
As our task is to indentify the chimical propoerties which influence the quality, let's lot at it first.
ggplot(aes(x=quality), data=wqw) + geom_histogram()
The distribution is discrete, let's change the binwidth to 1.
ggplot(aes(x=quality), data=wqw) + geom_histogram(binwidth=1)
The distribution of the quality look kind of normal with a peak at 6.
summary(wqw$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The minium of 3, maxium of 9 and 50% of the values between 5 and 6.
table(wqw$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Quality values are very concentrate. Let's see which percentage of the sample each value represents.
prop.table(table(wqw$quality))
##
## 3 4 5 6 7 8
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869
## 9
## 0.001020825
Well that's around 45% of the wine with 6, nearly half of the sample. 6 seems like a very average value. The sum of 5, 4 and 3 account for around 33%. The sum of 7,8 and 9 account for around 22%. Seems that we could use those group to categorize our wines.
wqw$quality.group <- cut(wqw$quality, labels=c('low', 'average', 'high'), breaks=c(0, 5, 6, 10))
Fixed acidity is indicate as tartaric acid in the data description. Tartaric acid is a distinctive molecule. However when searching for fixed acid, the documentation read indicate it's a class of acid and tartaric acid and acid and citric acid is part of them. [http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity]
ggplot(aes(x=fixed.acidity), data=wqw) + geom_histogram()
For the fixed acidity we have a normal distribution with a few outlines with 12 and 14. The majority of the wine have between 6.3 and 7.3.
Let's try to use a better bin size.
table(wqw$fixed.acidity)
##
## 3.8 3.9 4.2 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5
## 1 1 2 3 1 1 5 9 7 24 23 28 27 28 31
## 5.6 5.7 5.8 5.9 6 6.1 6.15 6.2 6.3 6.4 6.45 6.5 6.6 6.7 6.8
## 71 88 121 103 184 155 2 192 188 280 1 225 290 236 308
## 6.9 7 7.1 7.15 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2
## 241 232 200 2 206 178 194 123 153 93 93 74 80 56 56
## 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 52 35 32 25 15 18 16 17 6 21 3 11 2 5 4
## 9.8 9.9 10 10.2 10.3 10.7 11.8 14.2
## 8 2 3 1 2 2 1 1
The measurment have a 0.1 precision.
ggplot(aes(x=fixed.acidity), data=wqw) +
geom_histogram(binwidth=.1)
From online research, volatile.acidity is the steam of distillable acids. Note that the US legal limit is 1.1 g/L. I assume that our data are in g/L. It is normaly not detectable up to 3g/L.
ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram()
The shape of the volatile acidity is approaching normal distribution with 75% below 0.32g/L. We also see a kind of tail effect with heigher values. We can see some a few outline around 0.9g/l and the top value at 1.1g/L which is right on the US legal limit.
Let's find a more fine grain bin stat
table(wqw$volatile.acidity)
##
## 0.08 0.085 0.09 0.1 0.105 0.11 0.115 0.12 0.125 0.13 0.135 0.14
## 4 1 1 6 6 13 3 34 3 44 1 56
## 0.145 0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19 0.2 0.205
## 4 88 5 141 2 140 1 177 5 170 214 4
## 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245 0.25 0.255 0.26 0.265
## 191 1 229 4 216 4 253 4 231 10 240 5
## 0.27 0.275 0.28 0.285 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.325
## 218 3 263 5 160 3 198 4 148 4 182 2
## 0.33 0.335 0.34 0.345 0.35 0.355 0.36 0.365 0.37 0.375 0.38 0.385
## 134 7 135 9 86 1 104 2 65 2 63 2
## 0.39 0.395 0.4 0.405 0.41 0.415 0.42 0.425 0.43 0.435 0.44 0.445
## 61 2 59 1 54 4 36 2 35 2 46 4
## 0.45 0.455 0.46 0.47 0.475 0.48 0.485 0.49 0.495 0.5 0.51 0.52
## 25 2 30 15 3 17 3 14 2 14 10 10
## 0.53 0.54 0.545 0.55 0.555 0.56 0.57 0.58 0.585 0.59 0.595 0.6
## 8 10 1 14 2 9 4 7 2 4 2 7
## 0.61 0.615 0.62 0.63 0.64 0.65 0.655 0.66 0.67 0.68 0.685 0.69
## 7 4 5 2 7 2 3 4 5 3 1 2
## 0.695 0.705 0.71 0.73 0.74 0.75 0.76 0.78 0.785 0.815 0.85 0.905
## 3 2 1 1 1 1 2 1 1 1 1 1
## 0.91 0.93 0.965 1.005 1.1
## 1 1 1 1 1
Seems that the precision is 0.005
ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram(binwidth=.005)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
I have the impression that the sampling of the machine that took the sample was not properly done. We get many 0.0X precision and very few 0.0X5 precision. I will adopt a 0.01 bin size to smooth the plot.
ggplot(aes(x=volatile.acidity), data=wqw) + geom_histogram(binwidth=.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
From my internet search citric acid is contributing to the fixed acidity. It' is usualy present between 0 to 0.5g/L in wine.
ggplot(aes(x=citric.acid), data=wqw) + geom_histogram()
Cirtic acid seems to follow a normal distribution with a peak at 0.3g/L. Again a few outliners at 1.25g/L and 1.7g/L.
Let's fine a finer grain bin size
table(wqw$citric.acid)
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
The data precision is 0.01.
ggplot(aes(x=citric.acid), data=wqw) +
geom_histogram(binwidth=.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
table(subset(wqw, citric.acid == .49 | citric.acid == .74)$citric.acid)
##
## 0.49 0.74
## 215 41
The distribution seems normal except of a few outlines at 1.66g/L or 1.23g/L.
Note the 2 sharp peaks of concentration:
Instinctively, it seems the result of a carefully controlled additive to the wine. Indeed citric acid can be used to boost acidity and add “freshness [http://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid]. But one shouldn't add too much otherwise as the it adds a strong citric flavor.
Let's create a categorical variable for those value of citric acid.
wqw$added.citric.acid <- ifelse(wqw$citric.acid == .49 | wqw$citric.acid ==.74, 'yes', 'no')
The residual sugar that was not transformed during frementation in g/L.
ggplot(aes(x=residual.sugar), data=wqw) + geom_histogram()
We have a more long tail distribution, but the value at 1500 is maybe due to too wide bin. Also a few outliner at above 60 and around 30.
Let's look a bit more at the values.
table(wqw$residual.sugar)
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
No values higher than 150, and it seems that 0.1 would be a right binwidth. Let's also remove the outliners
ggplot(aes(x=residual.sugar),
data=subset(wqw, residual.sugar < quantile(residual.sugar, .9))) +
geom_histogram(binwidth=0.1)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
It's a more acceptable visualization.
Let see if we can get a normal distributon by taking the sqrt of the residual sugar.
ggplot(aes(x=sqrt(residual.sugar)), data=wqw) + geom_histogram(binwidth=0.1)
Well not very convincing…
Let see with log10 of the residual sugar.
ggplot(aes(x=log10(residual.sugar)), data=wqw) + geom_histogram()
Seems a bit better, we get a bimodale normal distribution.
As describe on http://en.wikipedia.org/wiki/Sweetness_of_wine there are categories of wine regarding sweetness.
ggplot(aes(x=residual.sugar),
data=wqw) +
geom_histogram(binwidth=0.1) +
scale_x_continuous(breaks=c(4, 12, 45))
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Error: arguments imply differing number of rows: 666, 655
It seems that we have a majority of dry wines… let's create a factor variable.
wqw$sweetness <- cut(wqw$residual.sugar, labels=c('dry', 'medium dry', 'medium', 'sweet'), breaks=c(0, 4, 12, 45, 500))
prop.table(table(wqw$sweetness))
##
## dry medium dry medium sweet
## 0.428133932 0.403225806 0.168436096 0.000204165
Here is the proportion of wine sweetness in our sample:
The amount of salt in the wine in g/L.
ggplot(aes(x=chlorides), data=wqw) + geom_histogram()
table(wqw$chlorides)
##
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022
## 1 1 1 4 4 5 5 10 9 16 19 19
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034
## 20 34 30 54 58 85 81 108 107 109 119 168
## 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046
## 130 200 160 167 157 182 147 184 141 201 170 181
## 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058
## 171 174 133 170 115 104 130 99 61 88 68 53
## 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07
## 36 46 19 25 23 15 8 18 18 7 18 6
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082
## 5 2 5 8 2 9 1 2 4 4 2 2
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094
## 5 5 3 4 3 2 1 2 1 3 3 5
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108 0.11 0.112 0.114
## 2 6 1 3 1 1 1 1 2 3 1 1
## 0.115 0.117 0.118 0.119 0.12 0.121 0.122 0.123 0.126 0.127 0.13 0.132
## 1 3 1 3 1 2 1 4 3 2 1 1
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149
## 1 1 1 2 2 3 1 1 1 2 1 1
## 0.15 0.152 0.154 0.156 0.157 0.158 0.16 0.167 0.168 0.169 0.17 0.171
## 1 2 1 1 4 1 2 2 3 2 2 1
## 0.172 0.173 0.174 0.175 0.176 0.179 0.18 0.184 0.185 0.186 0.194 0.197
## 2 2 2 2 2 1 1 2 2 1 1 2
## 0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239 0.24 0.244 0.255
## 1 2 1 2 1 1 1 1 1 1 1 1
## 0.271 0.29 0.301 0.346
## 1 1 1 1
Seems like 0.001 would be appropriate bin. Let's also remove 1% of high values.
ggplot(aes(x=chlorides),
data=subset(wqw, chlorides < quantile(chlorides,.99))) +
geom_histogram(binwidth=.001)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
The graph is now following a normal distribution between 0.009 to 0.069. However we still have a long tail from 0.08 up to 0.16.
Let's try without the 5% highest values.
ggplot(aes(x=chlorides),
data=subset(wqw, chlorides < quantile(chlorides,.95))) +
geom_histogram(binwidth=.001)
Free sulfur dioxide represent the free molecule of S02 in mg/dm3 and work as a preservative. This molecule is easily detectable above 50ppm.
ggplot(aes(x=free.sulfur.dioxide), data=wqw) + geom_histogram()
We have at least one outliner at around 300 and the bin with are too big with peak at 1250.
table(wqw$free.sulfur.dioxide)
##
## 2 3 4 5 6 7 8 9 10 11 11.5 12
## 1 10 11 25 32 25 35 29 55 45 1 51
## 13 14 15 15.5 16 17 18 19 19.5 20 21 22
## 55 68 79 1 58 89 80 84 1 101 93 102
## 23 23.5 24 25 26 27 28 28.5 29 30 30.5 31
## 110 1 118 111 129 99 112 1 160 99 1 132
## 32 33 34 35 35.5 36 37 38 38.5 39 39.5 40
## 109 112 128 129 2 127 111 102 1 89 1 103
## 40.5 41 41.5 42 42.5 43 43.5 44 44.5 45 46 47
## 1 104 2 86 1 63 1 75 4 101 64 91
## 48 48.5 49 50 50.5 51 51.5 52 52.5 53 54 55
## 66 7 82 64 2 54 1 72 4 68 61 58
## 56 57 58 59 59.5 60 60.5 61 61.5 62 63 64
## 42 44 37 39 2 38 2 47 1 29 30 23
## 64.5 65 66 67 68 69 70 70.5 71 72 73 73.5
## 1 14 17 22 24 17 11 1 5 6 8 4
## 74 75 76 77 77.5 78 79 79.5 80 81 82 82.5
## 5 7 5 5 1 4 2 4 1 7 2 1
## 83 85 86 87 88 89 93 95 96 97 98 101
## 4 2 2 4 1 1 1 1 3 1 3 2
## 105 108 110 112 118.5 122.5 124 128 131 138.5 146.5 289
## 2 3 1 1 1 1 1 1 1 1 1 1
Seems like a bin size of 1 would work
ggplot(aes(x=free.sulfur.dioxide),
data=subset(wqw, free.sulfur.dioxide < quantile(free.sulfur.dioxide, .99))) + geom_histogram(binwidth=1)
The value distribution have a quite flatted normal shape.
A total amound of S02 in mg/dm3. It include the free sulfure dioxide.
ggplot(aes(x=total.sulfur.dioxide), data=wqw) + geom_histogram()
Again a few outlines and what seems like a normal distribution. Let's ajust the bin size and remove the outliners.
summary(wqw$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
table(wqw$total.sulfur.dioxide)
##
## 9 10 18 19 21 24 25 26 28 29 30 31
## 1 1 2 1 1 3 1 1 4 2 2 1
## 33 34 37 40 41 44 45 46 47 48 49 50
## 1 2 3 3 4 1 2 2 3 1 4 3
## 51 53 54 55 56 57 58 59 60 61 62 63
## 3 2 2 7 5 7 2 5 6 9 2 10
## 64 65 66 67 68 69 70 71 72 73 74 75
## 6 8 7 12 14 10 8 12 17 20 12 14
## 76 77 78 79 80 81 82 83 84 85 86 87
## 26 14 17 15 23 21 17 17 27 20 25 39
## 88 89 90 91 92 93 94 95 96 97 98 99
## 15 23 30 22 30 42 28 34 28 41 49 34
## 100 101 102 103 104 105 106 107 108 109 110 111
## 37 47 37 34 44 41 32 45 32 37 47 69
## 112 113 114 115 115.5 116 117 118 119 120 121 122
## 31 61 54 45 1 47 57 55 47 42 37 54
## 123 124 125 126 127 128 129 129.5 130 131 132 133
## 33 53 49 50 38 54 32 2 46 47 47 50
## 134 135 136 137 138 139 140 141 142 143 144 145
## 47 41 38 27 45 28 52 29 46 44 35 30
## 146 147 148 149 150 151 152 153 154 155 156 157
## 31 31 44 48 54 39 43 32 27 39 47 31
## 158 159 160 161 162 162.5 163 164 164.5 165 166 167
## 38 34 32 37 34 2 36 27 1 19 39 32
## 168 169 170 171 172 173 174 175 176 176.5 177 178
## 43 29 32 27 28 32 28 16 24 1 27 41
## 179 180 181 182 183 184 185 186 187 188 189 189.5
## 26 34 21 30 35 30 18 25 19 23 30 3
## 190 191 192 193 194 195 196 197 198 199 200 201
## 17 28 18 15 21 17 16 28 18 10 18 16
## 202 203 204 205 206 207 208 209 210 211 212 212.5
## 13 7 13 12 14 10 10 11 23 8 15 6
## 213 214 215 216 216.5 217 217.5 218 218.5 219 219.5 220
## 14 10 10 8 1 4 1 4 3 6 1 7
## 221 222 223 224 225 226 227 228 229 230 231 232
## 13 7 9 9 4 3 8 8 9 6 5 1
## 233 234 234.5 235 236 237 238 238.5 240 241 242 243
## 2 7 1 2 3 3 5 1 7 2 2 6
## 244 245 246 247 248 249 249.5 251 252 253 255 256
## 2 5 1 3 3 2 1 4 2 3 1 2
## 259 260 272 282 294 303 307.5 313 344 366.5 440
## 1 1 2 1 1 1 1 1 1 1 1
ggplot(aes(x=total.sulfur.dioxide),
data=subset(wqw, total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) +
geom_histogram(binwidth=1)
The data distribution has a lot of noise, maybe some wider bin would attenuate this noise.
ggplot(aes(x=total.sulfur.dioxide),
data=subset(wqw, total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) +
geom_histogram(binwidth=3)
As often one see “contain sulfites” on wine bottle because less than 1% of the population is sulfit-sensitive. The label must be present with concentration higher than 10ppm. In the US the maximum authorized is 350ppm. It is also used as a measure for organic wine with maximum of 100ppm. [http://waterhouse.ucdavis.edu/whats-in-wine/sulfites-in-wine]
For liquide 1mg/L approximate of 1ppm. So if we want to represent those thresold on the graphe.
ggplot(aes(x=total.sulfur.dioxide),
data=wqw) +
geom_histogram(binwidth=3) +
scale_x_continuous(breaks=c(10,100,350))
It seems that all our white wines would have display in the “Contains Sulfites”. Still a portion of them could be consider are organic. 2 wines of our sample would not be authorized in the US.
Apparently this 10ppm thresold is health issue than anything to do with wine quality but still let's create a new variable contains.sulfies with 3 groups less than 10, between 10 and 100 and more than 100
wqw$contains.sulfites <- cut(wqw$total.sulfur.dioxide, labels=c('no', 'negligable', 'low', 'normal', 'high'), breaks=c(0, 1, 10, 100, 350, 800))
prop.table(table(wqw$contains.sulfites))
##
## no negligable low normal high
## 0.0000000000 0.0004083299 0.1880359330 0.8111474071 0.0004083299
Our sample contains:
According to the practical winemaker journal [http://www.practicalwinery.com/janfeb09/page2.htm] the ratio between free SO2 and total S02 is key for the preservation of the wine. So let's explore this ratio
ggplot(aes(x=free.sulfur.dioxide/total.sulfur.dioxide), data=wqw) +
geom_histogram(binwidth=0.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
We get a normal distribution of the ratio. Most of the values are contain between 10% to 40%
The article also mention that For dry table wines the level of free sulfur is usually somewhere around 40% to 75% of the level of total SO2. Well let's cross check with our sample.
ggplot(aes(x=free.sulfur.dioxide/total.sulfur.dioxide),
data=subset(wqw, sweetness == 'dry')) +
geom_histogram(binwidth=0.01) +
scale_x_continuous(breaks=c(.4,.75))
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Very few of our dry wine sample are contained in 40% to 75% ratio. Most of our wine are below 40%. After reading multiple time and double checking my variables and the article, i cannot figure out how our sample ratio is so different.
As this ratio seems important into wine conservation, let's add it as a variable keeping in mind that we couldn't really validate our values.
wqw$ratio.sulfur.dioxide <- wqw$free.sulfur.dioxide/wqw$total.sulfur.dioxide
ggplot(aes(x=density), data=wqw) + geom_histogram()
Let's look at the summary and the table to choose appropriate binwidth.
summary(wqw$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
table(wqw$density)
##
## 0.98711 0.98713 0.98722 0.9874 0.98742 0.98746 0.98758 0.98774
## 1 1 1 1 2 2 1 1
## 0.98779 0.98794 0.98802 0.98815 0.98816 0.98819 0.98822 0.98823
## 1 2 1 1 1 1 1 1
## 0.988245 0.98834 0.98836 0.9884 0.98845 0.98853 0.98854 0.98856
## 1 1 2 1 1 1 1 2
## 0.9886 0.98862 0.98865 0.98867 0.98868 0.98869 0.9887 0.98871
## 2 3 3 2 1 1 2 2
## 0.98872 0.98876 0.98878 0.9888 0.98882 0.98883 0.98884 0.98886
## 2 2 1 1 1 1 1 2
## 0.98889 0.9889 0.98892 0.98894 0.98895 0.98896 0.98898 0.989
## 3 5 5 3 2 2 1 4
## 0.98902 0.98904 0.98906 0.9891 0.98912 0.98913 0.98914 0.98915
## 1 1 2 2 5 1 5 1
## 0.98916 0.98918 0.98919 0.9892 0.98922 0.98923 0.98924 0.98926
## 4 4 1 7 1 1 3 6
## 0.98928 0.9893 0.98931 0.989315 0.98934 0.98935 0.98936 0.98938
## 2 8 2 1 5 1 5 2
## 0.98939 0.9894 0.98941 0.98942 0.989435 0.98944 0.98945 0.98946
## 2 6 1 5 1 7 2 7
## 0.989465 0.98947 0.98948 0.98949 0.9895 0.98951 0.98952 0.98953
## 1 1 2 4 6 1 7 3
## 0.98954 0.98956 0.98958 0.98959 0.9896 0.98961 0.98962 0.98963
## 3 5 4 3 9 5 2 6
## 0.98964 0.98965 0.98966 0.98968 0.9897 0.98972 0.98974 0.98975
## 9 1 3 4 6 4 3 3
## 0.98976 0.98978 0.9898 0.98981 0.98982 0.98984 0.98985 0.98986
## 2 1 18 3 1 6 2 3
## 0.98987 0.98988 0.9899 0.98992 0.98993 0.98994 0.98995 0.98997
## 2 4 8 1 4 3 1 1
## 0.98998 0.98999 0.99 0.99001 0.99002 0.99004 0.99005 0.99006
## 4 3 28 2 4 4 1 3
## 0.99007 0.99008 0.99009 0.9901 0.99011 0.99012 0.99013 0.99014
## 1 4 1 8 1 3 1 5
## 0.99015 0.99016 0.99018 0.99019 0.9902 0.99021 0.99022 0.99024
## 1 4 5 1 20 6 5 3
## 0.99026 0.99027 0.99028 0.9903 0.99031 0.99032 0.99033 0.99034
## 10 1 3 13 4 3 3 1
## 0.99035 0.99036 0.99037 0.99038 0.9904 0.99041 0.99042 0.99043
## 7 8 1 3 13 1 2 5
## 0.99044 0.99045 0.99046 0.99047 0.99048 0.9905 0.99051 0.99052
## 7 4 2 3 4 9 1 4
## 0.99053 0.99054 0.99055 0.99056 0.99057 0.99058 0.99059 0.9906
## 2 2 1 3 3 9 2 32
## 0.99061 0.99062 0.99063 0.99064 0.99065 0.99066 0.99067 0.99068
## 1 5 1 4 1 7 5 3
## 0.99069 0.9907 0.99071 0.99072 0.99074 0.99075 0.99076 0.99077
## 2 13 1 3 5 2 15 1
## 0.99078 0.99079 0.9908 0.99081 0.99082 0.99084 0.99085 0.99086
## 2 1 25 1 4 8 6 3
## 0.99088 0.99089 0.9909 0.99091 0.99092 0.99093 0.99094 0.99095
## 5 7 13 2 4 2 5 3
## 0.99096 0.99097 0.99098 0.99099 0.991 0.99102 0.99103 0.99104
## 4 2 4 2 34 1 1 4
## 0.99105 0.99106 0.99107 0.99108 0.99109 0.9911 0.99111 0.99112
## 2 2 1 3 2 25 5 6
## 0.99114 0.99115 0.99116 0.99117 0.99118 0.99119 0.9912 0.99121
## 8 1 5 1 2 3 33 3
## 0.99122 0.99123 0.99124 0.99125 0.99126 0.99127 0.99128 0.99129
## 3 4 3 3 8 1 4 3
## 0.9913 0.99132 0.99133 0.99134 0.99135 0.99136 0.99137 0.99138
## 16 6 2 5 2 3 1 9
## 0.99139 0.9914 0.99142 0.99143 0.99144 0.99146 0.99148 0.9915
## 2 39 7 3 8 7 4 10
## 0.99151 0.99152 0.99153 0.99154 0.99155 0.99156 0.99157 0.99158
## 4 5 3 5 3 3 1 6
## 0.99159 0.9916 0.99161 0.99162 0.99163 0.99164 0.99165 0.99166
## 4 23 2 4 1 11 6 7
## 0.99167 0.99168 0.9917 0.99171 0.99172 0.99173 0.99174 0.99175
## 1 7 34 1 5 4 7 3
## 0.99176 0.99177 0.99178 0.99179 0.9918 0.99182 0.99183 0.99184
## 11 2 8 1 40 6 1 12
## 0.99185 0.99186 0.99188 0.99189 0.9919 0.99192 0.99193 0.99194
## 5 6 8 4 14 5 2 4
## 0.99195 0.99196 0.99198 0.99199 0.992 0.99201 0.99202 0.99203
## 2 4 6 2 64 1 5 1
## 0.99204 0.99205 0.99206 0.99207 0.99208 0.99209 0.9921 0.99211
## 4 1 4 4 3 1 16 1
## 0.99212 0.99214 0.99215 0.99216 0.99218 0.9922 0.99221 0.99222
## 13 3 6 8 4 27 2 3
## 0.99223 0.99224 0.99225 0.99226 0.99228 0.99229 0.9923 0.99232
## 1 6 3 9 5 1 19 4
## 0.99234 0.99235 0.99236 0.99237 0.99238 0.99239 0.9924 0.99241
## 6 4 1 2 7 2 44 2
## 0.99242 0.99243 0.99244 0.99245 0.99246 0.99248 0.99249 0.9925
## 3 3 8 2 3 3 2 22
## 0.99251 0.99252 0.99253 0.99254 0.99255 0.99256 0.99257 0.99258
## 1 3 1 5 3 5 2 1
## 0.9926 0.99261 0.99262 0.99264 0.99265 0.99266 0.99267 0.99268
## 32 1 2 1 2 5 1 6
## 0.99269 0.9927 0.99271 0.99272 0.99273 0.99274 0.99275 0.99276
## 2 47 3 5 2 4 1 3
## 0.99278 0.99279 0.9928 0.99281 0.99282 0.99283 0.99284 0.99286
## 8 1 61 1 3 2 1 4
## 0.99287 0.99288 0.99289 0.9929 0.99293 0.99294 0.99295 0.99296
## 3 5 1 20 3 1 1 7
## 0.99297 0.99298 0.99299 0.993 0.99302 0.99304 0.99305 0.99306
## 4 1 4 52 1 10 4 5
## 0.99307 0.99308 0.99309 0.9931 0.99311 0.99312 0.99313 0.99314
## 3 3 1 28 1 4 3 6
## 0.99315 0.99316 0.99317 0.99318 0.99319 0.9932 0.99321 0.99322
## 3 4 1 4 1 53 3 3
## 0.99323 0.99324 0.99325 0.99326 0.99328 0.99329 0.9933 0.99331
## 1 6 1 7 3 1 17 2
## 0.99332 0.99334 0.99335 0.99336 0.99338 0.99339 0.9934 0.99341
## 4 5 5 3 6 2 50 1
## 0.99342 0.99344 0.99345 0.99346 0.99347 0.99348 0.9935 0.99352
## 1 4 2 2 4 3 17 5
## 0.99353 0.99354 0.99355 0.99356 0.99358 0.9936 0.99361 0.99362
## 1 4 1 3 3 34 1 14
## 0.99364 0.99365 0.99366 0.99367 0.99368 0.9937 0.99372 0.99373
## 4 3 4 1 5 35 1 5
## 0.99374 0.99375 0.99376 0.99378 0.99379 0.9938 0.99381 0.99382
## 2 2 1 3 1 49 1 8
## 0.99383 0.99384 0.99385 0.99386 0.99388 0.9939 0.99391 0.99392
## 2 2 1 1 6 28 2 3
## 0.99393 0.99394 0.99395 0.99396 0.99397 0.99398 0.99399 0.994
## 1 4 1 4 4 5 1 37
## 0.99402 0.99403 0.99404 0.99405 0.99406 0.99407 0.99408 0.9941
## 4 1 2 3 6 1 4 25
## 0.99411 0.99412 0.99413 0.99414 0.99415 0.99416 0.99418 0.9942
## 3 2 1 2 1 3 3 38
## 0.99422 0.99424 0.99425 0.99426 0.99427 0.99428 0.99429 0.9943
## 3 2 4 2 1 5 3 11
## 0.99432 0.99433 0.99434 0.99435 0.99436 0.99437 0.99438 0.99439
## 6 1 4 2 1 3 7 1
## 0.9944 0.99441 0.99442 0.99444 0.99445 0.99449 0.9945 0.99452
## 46 2 2 3 6 4 22 4
## 0.99453 0.99454 0.99455 0.99456 0.99457 0.99458 0.99459 0.9946
## 1 7 5 4 1 5 2 32
## 0.99461 0.99462 0.99463 0.99464 0.99466 0.99468 0.99469 0.9947
## 2 3 1 1 2 2 4 9
## 0.99471 0.99472 0.99473 0.99474 0.99475 0.99476 0.99477 0.99478
## 6 3 2 6 4 1 1 4
## 0.99479 0.9948 0.99481 0.99482 0.99485 0.99486 0.99488 0.99489
## 4 45 2 3 1 3 4 3
## 0.9949 0.99492 0.99494 0.99495 0.99496 0.99497 0.99498 0.99499
## 20 2 6 2 4 2 2 1
## 0.995 0.99502 0.99504 0.99505 0.99506 0.99507 0.99508 0.99509
## 25 4 1 3 1 1 4 2
## 0.9951 0.99511 0.99512 0.99513 0.99514 0.99516 0.99517 0.99518
## 25 1 7 2 5 4 2 4
## 0.99519 0.9952 0.99521 0.99522 0.99523 0.99524 0.99526 0.99527
## 3 37 2 1 2 3 3 2
## 0.99528 0.9953 0.99532 0.99534 0.99535 0.99536 0.99537 0.99538
## 4 32 4 5 2 3 4 4
## 0.99539 0.9954 0.99541 0.99542 0.99543 0.99544 0.99545 0.99546
## 1 44 2 8 2 7 4 8
## 0.99548 0.9955 0.99551 0.99552 0.99553 0.99554 0.99555 0.99556
## 4 30 4 3 1 1 2 7
## 0.99558 0.9956 0.99561 0.99562 0.99563 0.99564 0.99565 0.99566
## 8 41 2 3 1 6 1 5
## 0.99567 0.99568 0.9957 0.99571 0.99572 0.99573 0.99574 0.99576
## 3 4 15 4 5 3 1 7
## 0.99577 0.99578 0.99579 0.9958 0.99581 0.99582 0.99583 0.99584
## 2 6 3 40 4 5 1 2
## 0.99585 0.99586 0.99587 0.99588 0.9959 0.99591 0.99592 0.99594
## 1 3 8 2 23 1 4 3
## 0.99595 0.99596 0.996 0.99601 0.99602 0.99604 0.99605 0.99606
## 1 5 20 1 2 7 2 2
## 0.99608 0.9961 0.99611 0.99612 0.99615 0.99616 0.9962 0.99622
## 1 16 1 4 1 2 31 6
## 0.99624 0.99625 0.99626 0.99627 0.99628 0.99629 0.9963 0.99632
## 1 1 3 2 6 1 18 2
## 0.99634 0.99636 0.9964 0.99642 0.99644 0.99645 0.99646 0.9965
## 1 3 18 7 2 1 1 18
## 0.99652 0.99654 0.99655 0.99656 0.99657 0.99658 0.99659 0.9966
## 4 3 3 1 4 2 3 36
## 0.99662 0.99663 0.99665 0.99666 0.99668 0.99669 0.9967 0.99672
## 2 2 3 9 1 1 13 3
## 0.99674 0.99675 0.99676 0.99677 0.99678 0.99679 0.9968 0.99681
## 1 2 5 1 5 1 20 1
## 0.99683 0.99684 0.99685 0.99687 0.99688 0.9969 0.99691 0.99692
## 1 3 2 2 2 15 5 5
## 0.99693 0.99695 0.99696 0.99698 0.99699 0.997 0.99702 0.99704
## 2 1 1 1 5 21 1 3
## 0.99705 0.99706 0.99708 0.99709 0.9971 0.99711 0.99712 0.99713
## 8 2 2 1 11 4 1 1
## 0.99714 0.99715 0.99716 0.99718 0.9972 0.99724 0.99725 0.99726
## 1 1 1 4 33 2 1 4
## 0.99727 0.99728 0.9973 0.99732 0.99734 0.99736 0.99737 0.99738
## 3 2 7 3 1 3 1 1
## 0.9974 0.99741 0.99742 0.99745 0.99748 0.9975 0.99751 0.99752
## 28 1 6 2 2 17 1 3
## 0.99754 0.99755 0.99756 0.99758 0.9976 0.99767 0.99769 0.9977
## 6 3 3 2 34 2 1 12
## 0.99771 0.99772 0.99773 0.99775 0.99776 0.99778 0.99779 0.9978
## 2 4 7 2 4 2 1 23
## 0.99782 0.99784 0.99785 0.99786 0.99787 0.99788 0.9979 0.99792
## 5 8 1 2 2 1 24 10
## 0.99794 0.99795 0.998 0.99801 0.99802 0.99803 0.99804 0.99805
## 4 2 35 1 2 2 2 2
## 0.99806 0.99807 0.99808 0.9981 0.99814 0.99815 0.9982 0.99822
## 1 8 8 15 1 5 28 3
## 0.99824 0.99825 0.99827 0.998275 0.99828 0.9983 0.99831 0.99833
## 1 8 2 1 2 21 1 1
## 0.99834 0.99835 0.99836 0.998365 0.99837 0.99838 0.99839 0.9984
## 4 4 2 1 1 2 3 29
## 0.99841 0.99845 0.99848 0.9985 0.99851 0.99853 0.99855 0.99856
## 1 1 3 4 2 1 7 1
## 0.99858 0.9986 0.99862 0.99863 0.99864 0.99865 0.99869 0.9987
## 1 42 5 2 1 2 2 9
## 0.99872 0.99873 0.9988 0.99882 0.99884 0.99886 0.99888 0.9989
## 2 1 10 2 8 2 3 5
## 0.99896 0.99898 0.99899 0.999 0.99902 0.99904 0.99906 0.99907
## 4 3 1 11 1 3 5 5
## 0.99908 0.9991 0.99911 0.99916 0.99918 0.9992 0.99922 0.99924
## 2 9 3 2 1 5 5 4
## 0.9993 0.99935 0.99936 0.99938 0.9994 0.99941 0.99942 0.99943
## 6 1 1 1 9 1 2 1
## 0.99944 0.99945 0.99946 0.99947 0.9995 0.99954 0.99955 0.99956
## 1 3 7 2 3 3 1 2
## 0.9996 0.99965 0.99966 0.9997 0.99971 0.99975 0.99976 0.9998
## 8 1 1 6 2 3 6 17
## 0.99985 0.9999 1 1.0001 1.00013 1.00014 1.00016 1.0002
## 1 9 19 11 2 2 1 7
## 1.00022 1.0003 1.00037 1.00038 1.0004 1.00044 1.00047 1.0005
## 1 3 2 2 9 2 1 2
## 1.00051 1.00055 1.0006 1.0007 1.0008 1.00098 1.001 1.0011
## 1 1 4 1 3 1 5 2
## 1.00118 1.0012 1.0017 1.00182 1.00196 1.0024 1.00241 1.00295
## 1 1 2 1 1 1 1 2
## 1.0103 1.03898
## 2 1
Pretty difficult choice as the precision of the measure is going down to 0.00001
ggplot(aes(x=density),
data=subset(wqw, density < quantile(density, .99))) +
geom_histogram(binwidth=.00001)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
The data is very noisy. Let's increase the bin size.
ggplot(aes(x=density),
data=subset(wqw, density < quantile(density,.99))) +
geom_histogram(binwidth=.0001)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
The density distribution seems normal and trimodal.
pH ss a indicator of how acidic or basic the wine is.
ggplot(aes(x=pH), data=wqw) + geom_histogram()
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
No outliner here, but let see if we can adjust the binwidth.
Let see the values
table(wqw$pH)
##
## 2.72 2.74 2.77 2.79 2.8 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.9 2.91
## 1 1 1 3 3 1 4 1 9 9 9 11 17 31 15
## 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99 3 3.01 3.02 3.03 3.04 3.05 3.06
## 18 38 35 26 63 32 41 68 74 49 68 78 97 89 115
## 3.07 3.08 3.09 3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.2 3.21
## 79 136 92 135 126 134 117 172 136 164 124 138 145 137 95
## 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.3 3.31 3.32 3.33 3.34 3.35 3.36
## 146 116 132 114 96 88 87 82 93 79 86 49 79 48 83
## 3.37 3.38 3.39 3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.5 3.51
## 49 58 40 39 30 48 20 33 17 28 21 21 23 15 14
## 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.6 3.61 3.62 3.63 3.64 3.65 3.66
## 17 13 14 9 8 5 5 6 7 3 1 6 2 4 5
## 3.67 3.68 3.69 3.7 3.72 3.74 3.75 3.76 3.77 3.79 3.8 3.81 3.82
## 1 2 2 1 3 2 2 2 2 1 2 1 1
There is a 0.01 precision on the measurements.
ggplot(aes(x=pH), data=wqw) + geom_histogram(binwidth=0.01)
The pH seems to follow normal distribution.
summary(wqw$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The mean 3.188 and median 3.180 are nearly indentical. So all our white wine are acidic with value between 2.7 and 3.8.
Sulphates (or potassium solphate) are a wine additive for antimicrobial and antioxidant. It can also be use as fertilizer [http://www.solufeed.co.uk/solufeed-news/articles/2013/august/foliar-potassium-enhances-wine-quality.aspx].
ggplot(aes(x=sulphates), data=wqw) + geom_histogram()
Let's look at the value granularity
table(wqw$sulphates)
##
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37
## 1 1 4 4 13 13 16 31 35 54 59 84 85 120 129
## 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 214 151 168 139 181 161 216 178 225 172 179 166 249 140 156
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 135 167 102 108 83 99 97 88 45 68 48 67 28 36 35
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 44 30 27 18 33 12 19 22 19 16 19 16 5 5 13
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99
## 2 4 3 2 2 7 1 5 2 2 5 3 1 6 1
## 1 1.01 1.06 1.08
## 1 1 1 1
Seems like a 0.01 would fit our bin size.
ggplot(aes(x=sulphates), data=wqw) + geom_histogram(binwidth=0.01)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
The data curve is kind of normal and bimodal. From the table we can find a peak at 0.38 and at 0.5. We can also more cleary spoted some outliner above 1.0g/L
summary(wqw$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Most values lies between 0.41g/L and 0.55g/L
Alcohol is quite self explanatory… as a percentage per volume. 11.6% is consider as a global average.
ggplot(aes(x=alcohol), data=wqw) + geom_histogram()
Let try to adjust our bin size.
table(wqw$alcohol)
##
## 8 8.4 8.5 8.6
## 2 3 9 23
## 8.7 8.8 8.9 9
## 78 107 95 185
## 9.1 9.2 9.3 9.4
## 144 199 134 229
## 9.5 9.53333333333333 9.55 9.6
## 228 3 2 128
## 9.63333333333333 9.7 9.73333333333333 9.75
## 1 105 2 1
## 9.8 9.9 10 10.0333333333333
## 136 109 162 1
## 10.1 10.1333333333333 10.15 10.2
## 114 2 3 130
## 10.3 10.4 10.4666666666667 10.5
## 85 153 2 160
## 10.5333333333333 10.55 10.5666666666667 10.6
## 1 2 1 114
## 10.65 10.7 10.8 10.9
## 1 96 135 88
## 10.9333333333333 10.9666666666667 10.98 11
## 2 3 1 158
## 11.05 11.0666666666667 11.1 11.2
## 2 1 83 112
## 11.2666666666667 11.3 11.3333333333333 11.35
## 1 101 3 1
## 11.3666666666667 11.4 11.4333333333333 11.45
## 1 121 1 4
## 11.4666666666667 11.5 11.55 11.6
## 1 88 1 46
## 11.6333333333333 11.65 11.7 11.7333333333333
## 2 1 58 1
## 11.75 11.8 11.85 11.9
## 2 60 1 53
## 11.94 11.95 12 12.05
## 2 1 102 1
## 12.0666666666667 12.1 12.15 12.2
## 1 51 2 86
## 12.25 12.3 12.3333333333333 12.4
## 1 62 1 68
## 12.5 12.6 12.7 12.75
## 83 63 56 3
## 12.8 12.8933333333333 12.9 13
## 54 2 39 36
## 13.05 13.1 13.1333333333333 13.2
## 1 18 1 14
## 13.3 13.4 13.5 13.55
## 7 20 12 1
## 13.6 13.7 13.8 13.9
## 9 7 2 3
## 14 14.05 14.2
## 5 1 1
The data precision is 0.1
ggplot(aes(x=alcohol), data=wqw) + geom_histogram(binwidth=0.1)
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
There are 4,898 white wines in the dataset with 13 variables:
Main observations:
The most important feature is the quality. For the rest of the features, it's not easy at this stage to clearly identify which one is really important. A good wine is a well balanced composition that doesn't seems connected one particular chimical components.
Still difficult to indentify which feature will help, but the density, the alcohol and suflur dioxine, volatile acidity (the vinager taste) might be more helpful.
I created 3 categorical variables and 1 continious variable.
The first categorical is sweetness. The residual.sugar has been used to categorize the wines.
The second categorical is contains.sulfites. It's more a reglementation mark than any taste category but it could be interesting.
The third categorical is add.citric.acid. A boolean to mark the wine with an non-normal concentration of citric acid.
The continious varible is ratio.sulfur.dioxide, the ratio of free.sulfur.dioxide over the total.sulfur.diovide.
The residual sugar had a kind of long tail distribution. By doing a log10 transformation it became a bimodal normal distribution. I didn't changed the value but will keep in mind this property of the distribution.
set.seed(231)
sample.ids <- sample(levels(wqw$X), 2000)
ggpairs(subset(wqw, X %in% sample.ids )[,2:18])
According to the matrix the density and residual.sugar have a strong correlation at 0.83. Let's visualize in a scaterplot.
ggplot(aes(x=density, y=residual.sugar),
data=wqw) +
geom_point(alpha=.2)
It looks like a linera relstionship.
ggplot(aes(x=density, y=residual.sugar),
data=subset(wqw, residual.sugar<30)) +
geom_point(alpha=.5) +
geom_smooth(method="lm")
Well we have a strong relashionship and it definitly make sense. Indeed the more you add suggar in liquide, the more liquide will disolve the sugar and increase in density.
A second strong correlation number is between the alcohol and the density with -0.78. Let's create a scater plot to explore this relationship.
ggplot(aes(x=alcohol, y=density),
data=subset(wqw, density < 1.01)) +
geom_jitter(alpha=.2)
ggplot(aes(x=alcohol, y=density),
data=subset(wqw, density < 1.01)) +
geom_jitter(alpha=.2) +
geom_smooth(method="lm")
The alcohol and density seem to follow a linear relationship. Which make definitly sense as the density of alcohol is lower than the water ( which is 1). The more concentrate in alcohol the more the density is going down.
A third correlation number is a moderate 0.53 between the total sulfur dioxide and the density.
ggplot(aes(x=total.sulfur.dioxide, y=density),
data=subset(wqw, density < 1.01)) +
geom_jitter(alpha=.2)
The scater plot is not very convincing. It looks like a small correlation relationship.
Between the total sulfur dioxide and the residual sugar, there is correlation moderate coefficient of 0.47. Let's have a closer look.
ggplot(aes(x=total.sulfur.dioxide, y=log10(residual.sugar)),
data=subset(wqw, residual.sugar < quantile(residual.sugar, .99))) +
geom_jitter(alpha=.2)
A bit confusing to get any information from this graph. An additional variable might be usefull here.
A positive moderate correlation number of 0.43 was spotted in the matric between the quality and the level of alcohol.
ggplot(aes(x=as.factor(quality), y=alcohol), data=wqw) +
geom_boxplot()
Look like the good wine of our sample have more alcohol. In average higher quality wines contain more alcohol than the average wines. Note that the average wine quality have a lower alcohol than the worst wine quality.
Another moderate negative correlation number of -0.45 between the pH and the fixed acidity.
ggplot(aes(x=pH, y=fixed.acidity),
data=wqw) +
geom_jitter(alpha=.2) +
geom_smooth(method='lm')
We clearly see that the more fixed acidity the lower the pH. This makes totally sense as the low pH is more acid.
The total and the free sulfure dioxides have a correlation coefficient of 0.61. Let's investigate more.
ggplot(aes(x=total.sulfur.dioxide, y=free.sulfur.dioxide), data=wqw) +
geom_point(alpha=.2) +
coord_cartesian(xlim=c(0,300)) +
geom_smooth(method='lm')
The relationship look linear. Which in a way make sense as free sulfure dioxide is part of the total sulfure dioxide. Let's now plot the relationship between the total - fee vs free.
ggplot(aes(x=total.sulfur.dioxide - free.sulfur.dioxide, y=free.sulfur.dioxide), data=wqw) +
geom_point(alpha=.2) +
coord_cartesian(xlim=c(0,300)) +
geom_smooth(method='lm')
Well not very conclusive, we arrive at a rather low correlation shape.
cor.test(x = wqw$total.sulfur.dioxide - wqw$free.sulfur.dioxide,
y = wqw$free.sulfur.dioxide)
##
## Pearson's product-moment correlation
##
## data: wqw$total.sulfur.dioxide - wqw$free.sulfur.dioxide and wqw$free.sulfur.dioxide
## t = 19.1158, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2372821 0.2894077
## sample estimates:
## cor
## 0.2635373
Only a weak 0.26 correlation coefficient.
I read that the ratio sulfur dioxide influence the pH. Let's check if we get something….
ggplot(aes(x=pH, y=ratio.sulfur.dioxide), data=wqw) +
geom_point(alpha=.2)
Well it look like a correlation of 0… Definitly no related.
Let's compare pH in different quality
ggplot(aes(x=as.factor(quality), y=pH), data=wqw) +
geom_boxplot()
The very best wines have a very controlled/narrowed pH. As opposed as the worst wines that are more spread and lower -more acidic- pH. There is much more outliners for average wines quality (5 and 6) but those are the vast majority of our sample. The quality 5 has the lowest mean of pH.
Chloride (or salt) is a great taste enhancer, let see the relationship with quality
ggplot(aes(x=as.factor(quality), y=chlorides), data=wqw) +
geom_boxplot()
Well best wines don't have a low level of clorine and again the biggest quality. We can spot again many outliners for the quality 5 and 6. Let's try to get more details.
ggplot(aes(x=as.factor(quality), y=chlorides), data=wqw) +
geom_boxplot() +
coord_cartesian(ylim=c(.01, .075))
The better the wine, the lower the chloride level. Except for the worst wine graded 3 and 4, are they not even worth a bit of chloride?
Too much volatile acidity is supposed to produce the vinager smell of the wine. Let's see if the worst wine are the one with a vinager smell
ggplot(aes(x=as.factor(quality), y=volatile.acidity), data=wqw) +
geom_boxplot()
Actually the worst wines (quality 3) don't have the highest level of volatile acidity. However the wines of quality 4 have the highest average concentration and a few high outliners.
Let's compare density in different quality groups
ggplot(aes(x=as.factor(quality), y=density), data=wqw) +
geom_boxplot() +
coord_cartesian(ylim = c(.985, 1))
The best wines (quality 7, 8 and 9) have in average a lower density.
ggplot(aes(x=quality, y=density), data=wqw) +
geom_jitter(alpha=.1) +
coord_cartesian(ylim = c(.985, 1))
Let's compare total sulfur dioxide according to quality groups
ggplot(aes(x=as.factor(quality), y=total.sulfur.dioxide), data=wqw) +
geom_boxplot()
Intersting plot as the better the quality the more narrow the variation of total sulfur dioxide. It's as if the best wine producers are more in control of the sulfur dioxide and don't let it variate much.
Let see if the wine with those non-normal levels of citric acid are rated in quality.
ggplot(aes(x=as.factor(quality), y=citric.acid), data=wqw) +
geom_jitter(aes(color=added.citric.acid), alpha=.2) +
coord_cartesian(ylim=c(0,1))
ggplot(aes(x=citric.acid), data=wqw) +
geom_histogram(binwidth=.01) +
facet_wrap(~ quality, scales='free')
Well the non normal concentration for qualities 4, 5, 6 and 7. For 3, 8 and 9 you cannot spot a peak at 0.49 and 0.74.
On one hand two features have a positive effect on the density the sugar and total suflure dioxide. On the other hand the alcohol has a negative effect on the density.
The best white wines have a low density and high alcohol. Therefor a wine producer should maximise the fermentation to consume most of the residual sugar to make as much alcohol a possible.
The free and total sulfure dioxides were correlated because the later is containing all of them. The difference between the total and the free sulfure dioxides are called bound sulfure dioxide [http://www.practicalwinery.com/janfeb09/page2.htm]. In our sample the bound and free sulfure dioxide only have a weak (0.26) correlation coefficient.
The total sulfur dioxide and pH variation on quality seem to tell the story that the wine producer who make better wine are more in control of the sulfure dioxide or the pH.
The strongest relationship was between the density and residual sugar. The density is strongly positively correlated with the residual sugar. An also strong negative correlation exist between the alcohol and the density.
As exposed in the bivariate plots about the relationship betwee density, residual.sugar and alcohol. Let's get a better feeling of it.
ggplot(aes(x=density, y=residual.sugar, color=alcohol),
data=subset(wqw, density < quantile(density, .99))) +
geom_jitter() +
scale_y_continuous(trans=log10_trans())
We can clearly see that for a given residual sugar with higher alcohol the density is lowering. When the residual sugar increase the alcohol is lower.
ggplot(aes(x=density, y=residual.sugar, color=alcohol),
data=subset(wqw, density < quantile(density, .99))) +
geom_jitter() +
scale_y_continuous(trans=log10_trans()) +
facet_wrap(~ quality)
The better wine (7 to 9) have on average a lower residual sugar and higher alcohol concentration. The worst wine (3 to 5) don't produce a lot of alcohol. The average wines (6) has those 2 caracteristics.
p1 <- ggplot(aes(x=density, y=residual.sugar, color=alcohol),
data=subset(wqw, density < quantile(density, .99))) +
geom_jitter() +
scale_y_continuous(trans=log10_trans()) +
ggtitle("all wines")
p2 <- ggplot(aes(x=density, y=residual.sugar, color=alcohol),
data=subset(wqw, density < quantile(density, .99) & quality == 6)) +
geom_jitter() +
scale_y_continuous(trans=log10_trans()) +
ggtitle("quality 6 wines")
grid.arrange(p1,p2)
The average wines (6) are a good subset to repesent those 2 characteristics.
Let's have a look again at the total.sulfur.dioxide vs residual.sugar. Maybe by adding quality as color it would help us identify a pattern.
ggplot(aes(x=total.sulfur.dioxide, y=log10(residual.sugar), color=quality),
data=subset(wqw, residual.sugar < quantile(residual.sugar, .99))) +
geom_jitter()
Well not really helpful ….
As continuity with the previsou graphs, let's see how our sweetness variable can be used.
ggplot(aes(x=density, y=alcohol, color=sweetness), data=wqw) +
geom_jitter(alpha=.8)
I like this plot as it connect to my past experience with different wine sweetness.
ggplot(aes(x=density, y=total.sulfur.dioxide, color=alcohol),
data=subset(wqw,
density < quantile(density, .99) &
total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) +
geom_jitter()
We rathe see a relationship between density and alcohol on this previous plot.
ggplot(aes(x=density, y=total.sulfur.dioxide, color=alcohol),
data=subset(wqw,
density < quantile(density, .99) &
total.sulfur.dioxide < quantile(total.sulfur.dioxide, .99))) +
geom_jitter() +
facet_wrap(~ quality)
Well those last two plots are not really helping us in our exploration. let's drop the suflur dioxide and look form the angle of the contains.sulfites variable
ggplot(aes(x=contains.sulfites, y=pH, color=as.factor(quality)),
data=subset(wqw, fixed.acidity < quantile(fixed.acidity,.99))) +
geom_jitter(aplha=.5)
ggplot(aes(x=contains.sulfites, y=alcohol, color=as.factor(quality)),
data=subset(wqw, fixed.acidity < quantile(fixed.acidity,.99))) +
geom_jitter(aplha=.2)
Well the only insight i get from this graph is that the low sulfites seems to be on average of better quality. Let's go back to a simple boxplot.
ggplot(aes(x=as.factor(quality), y=total.sulfur.dioxide),
data=wqw) +
geom_boxplot()
Back to square one with the understanding and visualisation of the suflure dioxide. I'm a bit clueless. Let's try with pH.
ggplot(aes(x=pH, y=total.sulfur.dioxide, color=as.factor(quality)),
data=subset(wqw, density < quantile(density,.99))) +
geom_jitter(alpha=.5)
Seems like another deadend.
I was looking at “Which chemical properties influence the quality of white wines?”.
From my exploratory data analyse, it appears that the pH, the residual sugar, the density, chlorides and the alcohol can help us identify a good wine. The lower the residual sugar and the chlorides and the higer the pH, the density and the alcohol, the better the wine.
The alcohol concentration is a good approximation of the quality of the wine as it illustrates that the fermentation process was well done and very little residual sugar is left in the bottle.
However a good wine appears to be the right balance of many chemical properties that prevent me to identify a linear model.
Regarding citric acid, it seem and additive commonly used accross all the quality of wines. I would have expect that good quality wine would not rely on such additive. Also i need to find a official confirmation but European Union might not allowed this additive.
No i fail to identify or transform my variables to support a linear model.
ggplot(aes(x=density, y=alcohol, color=sweetness),
data=wqw) +
geom_jitter(alpha=.8) +
xlab("Density (g/cm^3)") +
ylab("Alcohol (%)") +
scale_color_discrete(name="Sweetness") +
geom_vline(xintercept=1, linetype="dotted") +
ggtitle("Wine Alcohol vs Density by Sweetness")
It would be a nice plot for wider audiance. It allows to compare how the different wine sweetness impact on the alcohol and density. The dry wine could move quite high in the percentage of alcohol. Those dry wine would feel lighter. As opposed a medium white wine which still contains quite some residual sugar and would feel heavier in the mouth like water (density of 1).
wqw$quality <- factor(wqw$quality, levels= rev(levels(as.factor(wqw$quality))))
ggplot(aes(x=citric.acid, fill=as.factor(quality), order=as.numeric(quality)),
data=subset(wqw, citric.acid < quantile(citric.acid,.999))) +
geom_histogram(binwidth=.01) +
ylab("Number of wines") +
xlab("Citric Acid (g/L)") +
scale_fill_discrete(name="Quality") +
ggtitle("Wine Bottles' concentration in Citric Acid")
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
This second plot show the peak of concentration of citric acid at 0.49g/L and 0.74g/L. I was quite shocked to see that even the better wine producers were using this technique in Portugal even though the European Union is not allowing it.
ggplot(aes(x=density, y=residual.sugar, color=alcohol),
data=subset(wqw, density < quantile(density, .99))) +
geom_jitter() +
scale_y_continuous(trans=log10_trans()) +
xlab("Density (g/cm^3)") +
ylab("Residual Sugar (g/L)") +
scale_color_continuous(name="Alcohol (%)") +
ggtitle("Residual Sugar vs Density vs Alcohol")
This third plot illustrates the balance between sugar and alcohol in setting the density of the wine. The bottle with high sugar have a low percentage of alcohol. A more compelet fermentation process lower the residual sugar increase the alcohol and as a result lower the density. The shape of the plot is logarithmic.
The exercise during the Udacity lessons 3 was much easier than figuring out a direction without guidance for this project. One has to go step by step. Even with a resonable number of variables (around 17 here) it was very difficult for me not to get lost. I had to move back and forth on this report to correct wrong conclusions or move plots from the univariate section to the bivariate or trivariate section.
Another big source of struggle was to match the dataset's variable with other information that i could find online. The names sulfates or sulfure or sulfite were a greate source of confusion. To add to the naming confusion some online searches provided very different averages for example with the ratio.sulfure.dioxide. After reading multiple sources online and coming back to the dataset description i slowly learnt the different componants but still i'm a bit puzzled with the difference in average.
The little success was to discover something that i already know (relationship between sugar, density and alcohol) but mostly the success feeling came when i selected the right graph for my purpose. I easily got suck in the analyze with certain type of graph. For example i couldn't find a way out with scater plot and histogram until i got the idea of using a boxplots which made a lot of relationships clearer. I also liked to add sweetness as a variable which helped me connect with the subject.
The next steps for further analyzes would be to